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Abstract A parallel 2D+1 split-step Fourier method with Crank-Nicholson 
scheme running on multi-core shared memory architectures is developed to 
study the propagation of ultra-short high-intensity laser pulses in air. The 
parallel method achieves a near linear speed-up with results for the efficiency 
of more than 95% on a 24-core machine. This method is of great potential 
application in studying the long-distance propagation of the ultra-short high 
intensity laser pulses. 
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1 Introduction 

Investigation of the ultra-short high-intensity laser pulses propagation in air 
has been a hot topic in recent years due to its physical interest as well as its 
potential applications. It is very important to predict well how the electromag¬ 
netic field of the pulse evolves as it propagates [T]. Although some analytical 
solutions with some approximations can be found in most applications 

the analytic approximations can not describe accurately the evolution of the 
pulses and we have to resort to numerical methods. Split-step Fourier meth¬ 
ods [5] with the Crank-Nicholson scheme (FCN) in the transverse direction is 
often employed to numerical calculate the propagation of the ultra-short laser 
pulses via solving the nonlinear Schrbdinger equation (NLSE) [5] . 

Ultra-short high intensity laser pulse can convey high intensity over ex¬ 
tended distances, and some applications need kilometer-range calculation. The 
filamentation induced by the ultra-short high-intensity pulses have been ob¬ 
served over several kilometers [B]. The alternative signs in the coefficients of 
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the high-order Kerr effects allow the pulses to propagate without much energy 
loss in a long distance mil]. However, the simulation for a long-distance prop¬ 
agation of the pulse is very time-consuming. For example, it may need several 
weeks to calculate a kilometer-range propagation. Parallel Split-step Fourier 
methods piiniin] have been developed to solve lD-l-1 NLSE. However, the 
lD-l-1 NLSE can not be used to simulate ultra-short high intensity laser pulse 
propagation in air, since it is unable to describe the transverse variations of 
the laser pulse. 

In this paper we develop a parallel FCN method to solve a 2D-f 1 NLSE, 
which can be employed to simulate ultra-short high intensity laser pulse propa¬ 
gation in air. The paper is organized as follows. In section 2 we briefly introduce 
the NLSE which describes the ultra-short high-intensity laser pulses propaga¬ 
tion in air. In section 3 the serial 2D-I-I FCN method for solving the NLSE 
equation is reviewed. Section 4 presents the parallel FCN algorithm for the 
2D-I-1 NLSE. Performance tests and a simulation for a long-distance pulse’s 
propagation are given in section 5. Conclusion is detailed in Section 6 . 


2 Nonlinear propagation equation 


The wave equation for the laser electric field E(r, t) is given by [5] 

02 1 02 


( 1 ) 


where 8 ^( 2 ;, y, z, t), and Snl{x, y, z, t) are the linear and nonlinear source of 
the electric field. E(a;, y, z, t), Sl(x, y, z, t), and S]vl( 2 ;, y, z, t) can be written 
as [ 2 ] 


E(a:, y, z, t) = y, z, t) exp {ikoZ - iuJot)ex + c.c , (2a) 

Sl{x, y, z, t) = y,z, t)exp{ikoZ - iuJot)ex + c.c , (2b) 

Snl{x, y, z, t) = ^Snl{x, y, z, t)exp {ikgZ - iuJot)ex + c.c , (2c) 


where c.c denotes the complex conjugate of the first term in the right hand side 
of equations, kg is the carrier wave number, ojg is the angular frequency of the 
pulse, Cx denotes the unit vector in the direction of polarization. A(x, y, z, t), 
SLix,y, z,t) and SNLix, y, z, t) are the complex amplitudes of E(a;,y,z,t), 
Sl^x, y, z, t) and Snl^x, y, z, t) respectively. 

For the ultra-short and high-intensity laser pulse, Sml can be written as 


Snl{x, y, z, t) 


^Kerr H” ^Plasma + Sic 


( 3 ) 


where SKerr denotes the nonlinear contribution from bound electrons, i.e., 
Kerr effect, 

ah 

SKerr{r,t) = - n2xj\A{r, t)\^A{r, t) , 

ZIq 


( 4 ) 




Parallel simulation for the ultra-short laser pulses’ propagation in air 


3 


here n 2 xj is Kerr nonlinear refractive index. The nonlinear index defines a 

critical nonlinear self-focusing power Per = ^Jnon 2 ^ gaussian input pulse. 
The plasma source term Spiasma is given by 

UJ^ 

Spiasmai^, t) = t) , (5) 

OJo 

where ojpe = p/mgeo )^denotes the plasma frequency, qe is the plasma 
density generated by ionization, rUe is the mass of electron and eg is the 
vacuum permittivity. 

The term Sion describes the depletion of laser energy due to ionization 

S^onir, t) = -ikoP^^^\A{r, t) , (6) 

where is the coefficient of multiphoton ionization for the number of pho¬ 
tons K. When the wavelength is 800 nm, K = 10 and = 

1.27 X 10-126(,jj^l7/y^9^ p] 

Substituting Eqs. (2)-(|ni) into Eq. ([T]) and transforming the independent 
variables from z,t to z,t via r = t — zjvg with Vg the linear group veloc¬ 
ity of the pulse, and applying the slowly varying envelope approximation 
d^Ajdz^ = 0, we can obtain the nonlinear schrddinger equation describing 
the propagation of ultra-short and high-intensity laser pulse as follow 


dz 




-AiA- 


ikg ^pe ^ 
2 ujo 


ik A 


ikp 

no 


4 

:e 

i=i 


n2* 


\Ar^)A 


2 


lA^-^A, 


( 7 ) 


where Z\j_ = jdr^ -V —d/dr for cylindrically symmetric beams or A± = 
d/dx^ + d/dy^ otherwise. In this paper the initial input pulse is chosen to 
have cylindrical symmetry, k =0.2 fs^/cm is the second order dispersion co¬ 
efficient. 

The rate equation for electron density qe can be written as 


dt 



( 8 ) 


where pat = 2.7 x 10^® cm“^ is the density of the neutral atoms. 

The core contribution of this work is to built a 2D-I-1 (time) parallel ECN 
solver for Eq.© and Eq.©. 


3 FCN method 

We restrict our attention to the axis-symmetric problems thus A = A(r, z, r). 
Uniform discrete lattice approximations is employed with lattice spacings Ar, 
Az, and At. A{r,z,T) is discretized into A{mAr,nAz,pAT+ 
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Tmin), in which 0 ^ m < Nr € N, 0 ^ n < € N, and 0 ^ p < Nt G N. In 

Fourier domain, A{r, z,uj) can be discretized into A{mAr, nAz,ujp). 

Let A'^ p denote A{mAr, nAz, pAr+Tmin) and denote A{mAr, nAz, cOp). 
Let A'^p represent the vector pj represent the vector A^ 

and represent the matrix equation ([7]) can 

be written as 

Ft A ^ ^ 

^ = (^ + iVM, (9) 

where D and N are the linear and nonlinear operators respectively, 


D = 


N = 


f 


2ko \dr^ 
4 


r dr 


ikp 

no 


\3 = 1 / 


ik 

2 dt^’ 

ikp a^pe 
2 ujp 




( 10 ) 

(11) 


In the step n, we first calculate the nonlinear part for a half step {Az/2), 
and then calculate the linear part for a full step Az, and finally calculate the 
nonlinear part for another half step {Az/2), i.e., 

4"+^ = exp()V^) exp(Szlz) exp()V^), (12) 

In the below we describe the algorithm in detail. 


3.1 The linear part, from A^ to 


The linear part is solved in the frequency domain. Firstly, A” in the time 

domain are transformed to = T{AN ) in the frequency domain. Here T 
denote Fourier transform. It follows Eq. m that the linear-part effect satisfies 


^ - _L 



A-—{tuj)^A . 
2 ^ ^ 


(13) 


Secondly, we discretize Eq. m using Crank-Nicholson scheme which is an 
implicit finite-difference method and is given by 


dA 


n+i 


dz 

^ -^m^p 

dr^ 

■L ■ri.m.v 


jn+i _ An 
■^m,p -^m,p 


2Ar^ 

An _ An 


(14) 


1 ,p 


A-n+l _ n An+l , A^+l 


2Ar^ 


(15) 


AU+l 

^m+l,p 


— 4"+-* 


/m{Ar)^ 


r dr 


(16) 











Parallel simulation for the ultra-short laser pulses’ propagation in air 


5 


Substituting Eqs. (O-dini) into Eq. m, and making use of the boundary 
conditions 


Ml 

dr 


'—0 — 0 , ^\r—max — 0 


we can obtain a matrix equation 


where 


/LfP A-n+l _ i\xp An 


(17) 

(18) 




MLf = 


bo Co 

ai bi Cl 

aNr-2 bNr-2 CNr-2 
^Nr-l bNr-1 . 

Co fo 

di Cl fi 

dNr-2 CNr-2 fNr-2 
dWr-l CNr-1 


(19) 


with 


iAz 


4koAr^ 
aNr-l = 0 , 

bo = 1 + 




bj = 1 


koAr^ 

iAz 


2koAr^ 
bNr-l = 1 ) 
iAz 

Co = 


, J&[l,Nr- 2 ] , 


c, = - 


koAr^ 

iAz 

4koAr^ 

iAz 




4koAr^ 
dNr-i = 0 , 




Co = 1 - 


iAz 


2koAr^ 

CNr-l = 0 , 


koAr^ 4 

iAz ik oj^Az 


4 


, jelJ,Nr-2] , 
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iAz 


4koAr^ 


iAz 



Finally, can be obtained from via inverse Fourier transforma¬ 


tion 



( 20 ) 


The pseudo-codes for the linear-part effects of the step n are listed as 
follows: 

first loop^ m=0,...,Nr — l 



end first loop 

second lonn- n — n — 1 



end second loop 


third loop: m = 0 ,nr — 1 



end third loop 

It is worth pointing out that the second loop involves a triangular matrix 
equation and thus can be solved efficiently via a chasing method. 

3.2 The nonlinear part 

The calculations of the nonlinear-part effects are divided into two stages, which 
can be calculated by 



( 21 ) 


and 



( 22 ) 


The calculations of Eqs. (EID and (1221) involves the plasma density, which 
can be obtained via solving Eq. ([5]) with a fourth-order Runge-Kutta method. 
The pseudo-codes for the nonlinear-part effects in the first half step are listed 
as follows: 

outer loop: m = 0,Nr — 1 
inner loop: p = 0 ,..., Nt — 1 


calculate electron density 
calculate N 



end inner loop 
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end outer loop 

The pseudo-codes for the another half step are same as that for the first 
half step, and we do not repeat here. 

In summary, the serial 2D-f 1 FCN method can be carried out by the fol¬ 
lowing steps: 

1. Input the initial data, e.g., the initial input pulse, the index of the NLS 
equation, the ranges of time and space, and the grid steps Ar, At, Az. 

2. Calculate the triangular matrix and for the discretized frequency 
ujp in Fourier domain. 

3. loop: n = 0 ,..., — 1 

Calculate the first half step of the nonlinear part. 

Calculate the linear part, from T" to . 

Calculate the second half step of the nonlinear part, 
end loop 


4 Parallel algorithm for FCN method 


Suppose we have P threads to carry out the FCN method in the simulations. 
Set the thread’s id to 1, 2,3, • • •, P. Let Rb{i) and Re{i) denote the begin grid 
number and the end grid number in the radial domain for the i*’*' thread. Let 
Tb(i) and Te(i) denote the begin grid number and the end grid number in the 
temporal domain for the i*’*' thread. In order to achieve the optimal parallel 
efficiency, the four arrays Rb, Re, Tb, and Te are set as follows 


Rb[i)=' 

Re{i)=< 

Tb{ih< 

Te{i)=' 


{l-l)\Nr/P}, ii^{Nr%P)Pl, 

{Nr%P){\NrlP} - [Nr/P\) + {i - l)lNr/P\ , Others , 


t\Nr/P]-l , 

{N,%P){\Nr/P^ - [N,/P\) + i[N,/P\ - 1 
{i-l)\Nb/P^ , 

{Nt7bP){\Nt|P^ - VNt/P\) + (* - l)[Nt/P\ 
i\Nt/P]-l , 

{Nt%P){\Nt/P^ - [Nt/P\) + i[Nt/P\ - 1 


i ^ {Nr%P) , 
others , 

* < {Nt%P) + l , 
others , 

z < {Nt%P) , 
others , 


where [ ] denotes the rounding-up (ceiling) operation, [ J denotes the 
rounding-down (floor) operation, and % denotes the modulo operation. 

In the step n, the calculations for the discrete (reverse) Fourier transforms 
and the nonlinear part are decomposed in the radial domain (Fig. [T] (a)), and 
the calculations for solving the linear equations m are decomposed in the 
temporal domain (Fig. [T] (b)). 

The pseudo-codes for the linear-part effects of the step n in the thread 
are listed as follows: 


first loop: m = Rb{i),Re{i) 



Cunliang Ma, Wenbin Lin* 




Fig. 1 Data decomposition, (a) decomposition in radial domain; (b) decomposition in tem¬ 
poral domaim. 


end first loop 

Waiting until all other P-1 threads finish the first loop 
second loop: p = Te{i) 



end second loop 

Waiting until all other P-1 threads finish the second loop 
third loop: m = Rb{i), ■■■, Reii) 

An+l — ) 

end third loop 

The pseudo-codes for the nonlinear-part effects of the first-half step n in 
the i**' thread are listed as follows: 

outer loop: m = Rb{i), ■■■, Re{i) 
inner loop: p = 0,Nt — 1 
calculate electron density 
calculate N 

end inner loop 
end outer loop 

The pseudo-codes for the another half step are same as that for the first 
half step, and we do not repeat here. 

The above parallel algorithm for the FCN method is built directly following 
the serial one. In order to better suite parallel programming, we re-organize 
the parallel algorithm into three parts basing on the data decomposition (see 
Fig. m. The pseudo-codes for the the step in the thread are listed as 
follows, 

outer loop: m = Rb{i), ■■■, Re{i) 
inner loop: p = 0,Nt — 1 
calculate electron density 
calculate N 
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^m,p ^ ^m,pexp(^N) 

end inner loop 

az,. = Haz,.) 

end outer loop 

Waiting until all other P-1 threads finish the corresponding loop 
loop: p = Tb{i),Te{i) 



end loop 

Waiting until all other P — 1 threads finish the corresponding loop 
outer loop: m = Rh{i), ■■■, Re{i) 

iri+l _ 17-1 (~An+l\ 

inner loop: p = 0,Nt — 1 
calculate electron density 
calculate N 



end inner loop 
end outer loop 



Fig. 2 The re-organized procedure basing on the data decomposition. 


Suppose a global integer variable Sync has the initial value 0, and a global 
mutex Mutex. The waiting other P — 1 threads can described as follows: 

lock(MMtea;) 

Sync -I- + 

\m\oc^{Mutex) 

loop 

if Sync%P==0 
break 
end if 
end loop 


5 Numerical experiments 

The specifications of the computer which is used for the numerical experiments 
are as follows 
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1. Software Environment: 

Operating System: CentOs 6.3 
Development platform: g++, Pthread 

2. Hardware Environment: 

CPU: Intel(R) Xeo(R) E7 - 4807 @ 1.87 GHz 
CPU: Cores: 24 

The initial laser pulse we consider is assumed to be a Gaussian beam m 



where rg is the beam width, Tp is the temporal half width, Ig is the input 
peak intensity, and C denotes the chirp of the incident pulse. 


5.1 Timings and accuracy 

The performance of the parallel program is measured by Speedup, which is 
defined as the ratio between sequential execution time and parallel execution 
time [T^ . 

^ , Sequential execution time 

Speedup = ^^- . (24) 

Parallel execution time 

A numerical example {Nr = 2112, Nt = 2048, and = 1000) is tested 
with different thread numbers, and Table 2 presents the comparison for the 
computational time with different thread numbers. 


Thread number 

Time (s) 

Speedup 

1 

5368.03 

1 

4 

1343.821 

3.99 

8 

671.937 

7.99 

12 

450.097 

11.93 

16 

344.076 

15.60 

20 

279.084 

19.23 


Table 1 Computational time and speed-up ratio for different thread numbers. 


For the accuracy, we have checked that all the simulation results of the 
parallel code with different threads are the same as that of the serial code. 


5.2 One application 

We employ the parallel algorithm to simulate the propagation of the ultra- 
short laser pulse in air for 1.1 kilometers. In the simulation, Iq = 3.14 x 10^^ 
W/m^, rg = 18 mm, Tg = 300 fs, and C = 0. The number of grids W = 
1350, Nt = 1024, and = 1.1 x 10®. Fig. |2] presents the evolution of the 
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on-axis intensity and the fiuence profile of the beam. It only takes the parallel 
program with 20 threads less than 3 days to do the simulation, in contrast, it 
would require about two months for a serial code to do the same work. 



P""") z(m) 


Fig. 3 Propagation of a laser pulse in the atmosphere. The input beam has a Gaussian 
shape with Iq = 3.14 X 10^^ W/m^, rg = 18 mm, rg = 300 fs, and C = 0 . (a) On-axis 
Intensity at different propagation distance, (b) The Fiuence profile as the function of the 
propagation distance. 


6 Conclusion 

In this paper, a parallel 2D-I-1 FCN method is developed which has been tested 
on multi-core shared memory architectures. The simulation results shows that 
the speed-up ratio is more than 19.2 when the thread number is 20. The 
parallel FCN algorithm is of great importance in the simulations for the long¬ 
distance propagation of the ultra-short laser pulse, which is very useful to 
many applications such as lightning control, remote sensing, and so on. 
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